It is noteworthy nowadays that monitoring and understanding a humanâ??s emotional state\nplays a key role in the current and forthcoming computational technologies. On the other hand,\nthis monitoring and analysis should be as unobtrusive as possible, since in our era the digital world\nhas been smoothly adopted in everyday life activities. In this framework and within the domain\nof assessing humansâ?? affective state during their educational training, the most popular way to\ngo is to use sensory equipment that would allow their observing without involving any kind of\ndirect contact. Thus, in this work, we focus on human emotion recognition from audio stimuli (i.e.,\nhuman speech) using a novel approach based on a computer vision inspired methodology, namely\nthe bag-of-visual words method, applied on several audio segment spectrograms. The latter are\nconsidered to be the visual representation of the considered audio segment and may be analyzed\nby exploiting well-known traditional computer vision techniques, such as construction of a visual\nvocabulary, extraction of speeded-up robust features (SURF) features, quantization into a set of\nvisual words, and image histogram construction. As a last step, support vector machines (SVM)\nclassifiers are trained based on the aforementioned information. Finally, to further generalize the\nherein proposed approach, we utilize publicly available datasets from several human languages to\nperform cross-language experiments, both in terms of actor-created and real-life ones.
Loading....